Protein Folding in CLP(FD) with Empirical Contact Energies
نویسندگان
چکیده
We present a declarative implementation in Constraint Logic Programming of the Protein Folding Problem, for models based on Face-Centered Cubes. Constraints are either used to encode the problem as a minimization problem or to prune the search space. In particular, we introduce constraints using secondary structure information. The protein folding problem is the problem of predicting the 3D structure of the protein (its native conformation) when the linear sequence of aminoacids identifying the protein is known. It is widely accepted that the native conformation ensures a state of minimum free energy. An energy function associated to a conformation depends on distances between any pair of aminoacids and on their type. Recently, a new potential matrix able to provide the energy between two aminoacids in contact depending on their type has been developed [1]. Thus, the problem can be reduced to that of minimizing this function on the admissible 3D conformations for the protein. This problem (the decision version of it) is proved to be NP-complete. However, the problem, which is of crucial importance in Biology and Biotechnology deserves to be attacked, and we can be encouraged by the non-huge typical length of a protein. The high accuracy obtained by secondary structure predictions, prompts for inclusion of them for protein folding prediction. Another important structural feature of proteins is the capability of cysteine residues of covalently binding through their sulphur atoms, thus forming disulfide bridges, which impose important contact constraints (also known as ssbonds). This information allows to reduce the search space and consequently the time to identify the minimal energy conformation. Constraint Programming [3] is a programming methodology well-suited for encoding combinatorial optimization problems. We use Constraint Logic Programming over finite domains applied on the modeling of the Protein Folding problem on the Face-Centered Cubic Lattice [4]. We encode the problem as a minimization problem, using a discrete lattice-based energy function. In detail, two aminoacids are in contact (and contribute to the global energy minimum) if they lie at distance of 2 lattice units in the folding. The present definition corresponds to a contact distance of 5:4 Å while much larger distances have been used by other authors [6]. We add constraints obtained by secondary structure prediction to reach acceptable computation time. In particular we develop a local coordinate system to define torsional angles. This allows to link efficiently the secondary structure information to the three-dimensional folding. Moreover we developed a new method to dynamically prune the search tree, based on the analysis of the contacts between the aminoacids during the folding process. We study the effectiveness of the method on some small proteins whose structure is known. A subproblem of the whole problem, based on the HP-model is studied and successfully solved for proteins of length 160 in [7]. The high level abstraction of the model does not ensure that the result is the native conformation; in particular, the local sub-conformations like -helices or -strands are often lost. To give a practical idea about the differences among models and experimental results, we depict the real folding of the protein Protegrin 1 (PDB id: 1PG1—Fig. 1), and the ones predicted with the HP-model (Fig. 2) and with our empirical contact energies model (Fig. 3). Note that the HP-model usually returns a compressed protein and often bend angles of 60Æ are predicted, while this is not common in nature. The results we obtained allow us to effectively predict proteins up to 60 aminoacids. We compare the previous results in [2] to the ones in this work. Note that for small protein, e.g. 1LE3—16 aminoacids, we have a speedup of 75 times for the computational times and the protein energy is improved by more than 3 times. For longer proteins, e.g. 1ED0 – 46 Figure 1. Protein 1PG1 from PDB Figure 2. Protein 1PG1, HPmodel, 6.9 Å RMSD error Figure 3. Protein 1PG1, our model, 3.3 Å RMSD error aminoacids, the speedup is 270 and the energy is slightly improved. Note that for the protein 1ED0, we reach the physical limit of memory used on our machine, while the computational time is still tractable. We have tested the program using the same set of proteins used in [2]. Computational results are reported in the table, where “b” stands for ssbond, “s” for strand, and “h” for helix. In the protein model systems 1LE3, 1PG1, and 1ZDD terminal protecting groups have been neglected. Experimental results Name N Secondary Information Time RMSD 1LE0 12 [s(2,4),s(9,11)] 4s. 2.9 Å 1KVG 12 [b(2,11),s(2,4), s(9,11)] 2s. 3.6 Å 1LE3 16 [s(2,6),s(11,15)] 5s. 3.0 Å 1EDP 17 [b(1,15),b(3,11), h(9,15)] 26s. 3.1 Å 1PG1 18 [b(6,15),b(8,13), s(4,9),s(12,17)] 0.8s. 3.3 Å 1ZDD 34 [b(5,34),h(3,13), h(20,33)] 41s. 3.9 Å 1VII 36 [h(4,8),h(15,18), h(23,32)] 6m.56s. 7.2 Å 1E0M 37 [s(7,12),s(18,22), s(27,29)] 9m.45s. 6.0 Å 2GP8 40 [h(6,21),h(26,38)] 9s. 5.1Å 1ED0 46 [b(3,40),b(4,32), b(16,26),h(7,18), h(23,30),s(2,4), s(33,34)] 2m.33s. 8.9Å 1ENH 54 [h(8,20),h(26,36), h(40,52),s(22,23)] 13m. 13.3 Å We have used SICStus PROLOG 3.10.0 (http://www.sics.se/sicstus/) and a PC AMD Duron 1000 MHz. The encouraging results obtained suggest us how to enhance the heuristic phase with the introduction of more sophisticated techniques for handling substrings already computed as high level units. In the next future we are planning to implement centroids of sidechains [5], to improve the RMSD errors, using the same lattice, to redefine the contact in order to match more closely the contact energy definition and to study and automatic procedure to suggest contacts between secondary structure subsequences.
منابع مشابه
Using Secondary Structure Information for Protein Folding in CLP(FD)
The protein folding problem is the problem of predicting the 3D structure of a protein when the linear sequence of aminoacids identifying it is known. In this paper we present a declarative implementation in Constrain & Generate style in CLP (FD) of the protein folding problem, for models based on face-centered cubes. We use information concerning secondary structure (and other heuristics) to s...
متن کاملContact order dependent protein folding rates: kinetic consequences of a cooperative interplay between favorable nonlocal interactions and local conformational preferences.
Physical mechanisms underlying the empirical correlation between relative contact order (CO) and folding rate among naturally occurring small single-domain proteins are investigated by evaluating postulated interaction schemes for a set of three-dimensional 27mer lattice protein models with 97 different CO values. Many-body interactions are constructed such that contact energies become more fav...
متن کاملUnderstanding protein folding with energy landscape theory. Part II: Quantitative aspects.
5. Thermodynamics and kinetics of protein folding 234 5.1 A protein Hamiltonian with cooperative interactions 234 5.2 Variance of native contact energies 235 5.3 Thermodynamics of protein folding 236 5.4 Free-energy surfaces and dynamics for a Hamiltonian with pair-wise interactions 240 5.5 The effects of cooperativity on folding 242 5.6 Transition-state drift 242 5.7 Phase diagram for a model ...
متن کاملEnergy Study at Different Temperatures for Active Site of Azurin in Water, Ethanol, Methanol and Gas Phase by Monte Carlo Simulations
The interaction between the solute and the solsent molecules play a crucial role in understanding the various molecular processes involved in chemistry and biochemistry, so in this work the potential energy of active site of azurin have been calculated in solvent by the Monte Carlo simulation. In this paper we present quantitative results of Monte Carlo calculations of potential energies of ...
متن کاملValidity of Gō models: comparison with a solvent-shielded empirical energy decomposition.
Do Gō-type model potentials provide a valid approach for studying protein folding? They have been widely used for this purpose because of their simplicity and the speed of simulations based on their use. The essential assumption in such models is that only contact interactions existing in the native state determine the energy surface of a polypeptide chain, even for non-native configurations sa...
متن کامل